Lab 3

Please note that there is a file on Canvas called Getting started with R which may be of some use. This provides details of setting up R and Rstudio on your own computer as well as providing an overview of inputting and importing various data files into R. This should mainly serve as a reminder.

Recall that we can clear the environment using rm(list=ls()) It is advisable to do this before attempting new questions if confusion may arise with variable names etc.

Example 1

In this example we want to determine whether there is any difference between the driving reaction times after drinking water, coffee and alcohol in the Kruskal-Wallis (Reaction Times) dataset. Please follow the steps below:

  • Firstly, we input the dataset into R and check it using the ``head” command (recall this provides the first 6 entries). We use the readxl package to import the Excel file.
library(readxl)
Reaction<-read_excel(file.choose())
head(Reaction)
# A tibble: 6 × 2
  `Reaction Times` `Drink Groups`
             <dbl>          <dbl>
1             0.37              1
2             0.38              1
3             0.61              1
4             0.78              1
5             0.83              1
6             0.86              1
  • We next perform a test of normality on the standardized residuals to determine which method should be used. If the residuals are normally distributed and we have homogeneity of variances (Levene’s test) then we would use ANOVA. The following code creates and performs tests of normality on the residuals.
anova_model<-aov(Reaction$"Reaction Times"~Reaction$"Drink Groups")
summary(anova_model)
residuals<-rstandard(anova_model)
shapiro.test(residuals)

    Shapiro-Wilk normality test

data:  residuals
W = 0.89219, p-value = 0.005437
library(nortest)
lillie.test(residuals)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  residuals
D = 0.14947, p-value = 0.08544
  • Clearly the residuals are not normally distributed, therefore we must use the nonparametric equivalent of the ANOVA procedure, i.e. the Kruskal-Wallis H-test. In order to determine the correct hypotheses, we next investigate the shapes of the distributions of the groups using histograms:
DrinkTypes<-factor(Reaction$"Drink Groups",c(1,2,3),labels=c('Water','Coffee', 'Alcohol'))
par(mfrow=c(1,3)) 
hist(Reaction$"Reaction Times"[DrinkTypes=='Water'], xlab='Reaction Time', 
     main='Histogram for Water', breaks=seq(from=0,to=4, by=0.2)) 
hist(Reaction$"Reaction Times"[DrinkTypes=='Coffee'], xlab='Reaction Time', 
     main='Histogram for Coffee', breaks=seq(from=0,to=4, by=0.2))
hist(Reaction$"Reaction Times"[DrinkTypes=='Alcohol'], xlab='Reaction Time', 
     main='Histogram for Alcohol', breaks=seq(from=0,to=4, by=0.2))

  • From the histograms we see that the groups seem to have similar shaped distributions, therefore use the following hypotheses:

    • \(H_0:\) The medians of all groups are equal;
    • \(H_1:\) At least one group has median not equal to the others.
  • Next we perform the Kruskal-Wallis H test.

kruskal.test(Reaction$"Reaction Times"~Reaction$"Drink Groups")

    Kruskal-Wallis rank sum test

data:  Reaction$"Reaction Times" by Reaction$"Drink Groups"
Kruskal-Wallis chi-squared = 16.322, df = 2, p-value = 0.0002856
  • As we have a significant result we undertake a Dunn post hoc test with Bonferroni correction.
install.packages("FSA")
library(FSA)
dunnTest(Reaction$"Reaction Times"~factor(Reaction$"Drink Groups"), method="bonferroni")
Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Bonferroni method.
  Comparison         Z      P.unadj        P.adj
1      1 - 2 -2.095735 3.610569e-02 0.1083170553
2      1 - 3 -4.039053 5.366736e-05 0.0001610021
3      2 - 3 -1.943318 5.197773e-02 0.1559332006
  • You should see that there is a significant difference between drinks 1 and 3, i.e. between water and alcohol. We therefore report medians using the following code:
tapply(Reaction$"Reaction Times",Reaction$"Drink Groups",median)
    1     2     3 
0.845 1.445 2.250 

Exercise 1

The Diets dataset contains data on the weight loss of 3 groups of participants who took park in 3 different diets. Determine whether there is any difference in the effectiveness of the diets, fully justifying the method used.

Example 2

In this example we will continue the example of determining whether a bonus scheme improves lateness in employees. Recall the data below:

Employee Baseline Month 1 Month
1 16 17 11
2 10 5 2
3 7 8 0
4 13 9 5
5 17 2 2
6 10 10 9
7 11 6 5

We will now use a Friedman test to determine whether there is any difference in the medians of these groups. Please follow the steps below in R:

  • We first input the data:
Baseline<-c(16,10,7,13,17,10,11)
Month1<-c(17,5,8,9,2,10,6)
Month2<-c(11,2,0,5,2,9,5)
LatenessData<-data.frame(Baseline,Month1,Month2)
LatenessData
  Baseline Month1 Month2
1       16     17     11
2       10      5      2
3        7      8      0
4       13      9      5
5       17      2      2
6       10     10      9
7       11      6      5
  • Next we check the normality of the residuals using the following code. We first need to upload the dataset Work Lateness.csv, which in a different format. (It is not as straightforward as in the Kruskal-Wallis case.)
LatenessNew<-read.csv(file.choose(), header=T)
LatenessNew
attach(LatenessNew)
anova_model3 <- aov(Lateness~Bonus+Error(EmployeeA))                                      
anova.model3.pr <- proj(anova_model3)
residuals<-anova.model3.pr[[3]][, "Residuals"]
shapiro.test(residuals)

    Shapiro-Wilk normality test

data:  residuals
W = 0.94283, p-value = 0.2478
  • The residuals seem to be normally distributed, however we continue with the Friedman test as the sample size is small (and we will see shortly that the data are skewed). We next investigate the shapes of the group to determine the correct hypotheses.
par(mfrow=c(1,3)) 
hist(Baseline, xlab='Late Days', main='Histogram for Baseline', breaks=seq(from=0,to=20, by=2)) 
hist(Month1, xlab='Late Days', main='Histogram for Month 1', breaks=seq(from=0,to=20, by=2))
hist(Month2, xlab='Late Days', main='Histogram for Month 2', breaks=seq(from=0,to=20, by=2))

  • We see that the groups seem to have the same shaped distribution, hence we compare medians in the hypotheses. The histograms are all skewed, providing further evidence that non parametric techniques are not suitable. We now proceed with Friedman’s test. Note that we have to make a matrix out of the data to perform the test.
LatenessMatrix<-matrix(c(Baseline, Month1, Month2),ncol=3, nrow=7,dimnames = list(1 : 7, c("Baseline", "Month1", "Month2")))
LatenessMatrix
  Baseline Month1 Month2
1       16     17     11
2       10      5      2
3        7      8      0
4       13      9      5
5       17      2      2
6       10     10      9
7       11      6      5
friedman.test(LatenessMatrix)

    Friedman rank sum test

data:  LatenessMatrix
Friedman chi-squared = 10.231, df = 2, p-value = 0.006004
  • Since we have a significant result we perform a post hoc test. We use the Nemenyi post-hoc tests in the PMCMRplus package.
install.packages("PMCMRplus")
library(PMCMRplus)
kwAllPairsNemenyiTest(LatenessNew$Lateness~LatenessNew$Bonus)

    Pairwise comparisons using Tukey-Kramer-Nemenyi all-pairs test with Tukey-Dist approximation
data: LatenessNew$Lateness by LatenessNew$Bonus
  0      1     
1 0.0794 -     
2 0.0072 0.6646

P value adjustment method: single-step
alternative hypothesis: two.sided
  • You should come to the conclusion from this output that there is a significant difference between Baseline and Month2. We next report their medians using the code:
median(Baseline)
[1] 11
median(Month2)
[1] 5

Exercise 2

Determine whether there are any differences between the groups Overall score (Video A), Overall score (Video B), Overall score (Video C) and Overall score (Demo D) in the Public Awareness SPSS dataset.

Hint: Before performing the Friedman test you will need to create a matrix of the relevant variables. The following code may help you to do this:

AwarenessMatrix<-matrix(c(Awareness$TotalAGen, Awareness$TotalBdoc, Awareness$TotalCOld,
        Awareness$TotalDDEMO), nrow=20,ncol=4, dimnames = list(1 : 20, c("A", "B", "C", "D")))

You will then need to create another dataset with one column containing all the awareness scores and a corresponding label column. I would suggest exporting your matrix created above using the write.csv() command and making these changes in Excel, then re-import your new dataset.

Back to top